Cancer is one of the leading causes of death in the USA and worldwide. There are several factors that might affect the cancer mortality rate, for example, incident rates, or the gender and age of a person. According to cancer.gov, we can study the trends of cancer mortality and incidence rate to see if there are better treatments in curing cancer in case there is a negative relationship between the factors. In this data analysis, we are interested to learn if there is a the correlation of cancer mortality rate, focusing on death causes by lung cancer in the USA, with different socioeconomics factors, such as age, income or poverty status.
There are two big questions we want to learn in this data analysis.
First, we want to examine the relationship between the lung cancer mortality rate and the incidence rate. Through this analysis, we can examine the likelihood of dying if diagnosed with a lung cancer is consistent across counties.
Secondly, we want to learn if the socioeconomic factors affects the cancer mortality rates across the counties in the USA. The chosen factors are:
1. Median Age: According to cancer.gov, the median age to be diagnosed lung cancer is 70. With different median age across the counties, we are interested to know how age affect the incidence rate in each counties.
2. Income: Income plays an important role to determine standard of living, which affects the chance of getting cancer. We want to know if income have a positive or negative relationship with incidence rates.
3. Poverty status: Similar to income, poverty status affects the standard of living. We want to know if poverty status has a significant relationship to incidence rates.
4. Health Insurance: Early diagnosis and treatments might reduce the cancer mortality rate. We want to know that relationship holds for lung cancer.
The 2 .csv files given in the dataset 1: death.csv and incd.csv were not enough for data analysis. With further research, I have found more information and work on them to add more data. I have created 4 extra .csv files with data on AgeSex, Income, Insurance and Poverty. (R Code used for these data, respectively: Data-AgeSex.R, Data-Income.R, Data-Insurance.R, Data-Poverty.R)
Importing the data:
wd <- "/Users/PhuongAnh/STAT350-Project"
setwd(wd)
death = read.csv("death.csv")
incident = read.csv("incd.csv")
agesex = read.csv("agesex.csv")
income = read.csv("income.csv")
insurance = read.csv("insurance.csv")
poverty = read.csv("poverty.csv")
Merge by FIPS and Rename the Data.
FIPS: The 5 digits code with the first 2 number is State Code and 3 last numbers are the county code
County: Name of the county and state
Total.Population: Total population in the county
Total.Male: Total Male in the county
Total.Female: Total Female in the county
Median.Age: Median Age by county
Median.Age_Male: Male Median Age by county
Median.Age_Female: Female Median Age by county
Median.Income: Median Income by county
Insurance.Coverage: Number of people is covered by at least one type of health insurance
Poverty: Number of Poverty Status
Average.Deaths.per.Year: Average Count of death by county
Age.Adjusted.Death.Rate: Age Adjusted Death Rate per 100,000
Death.Rates.Trend: Death Rate Trends
Age.Adjusted.Incidence.Rate: Age Adjusted Incidence Rate per 100,000
Incidence.Rate.Trend: Incidence Rate Trend
Average.Annual.Count: Average Count of Incidence by county
agesex = agesex[c("Total.Population","Total.Male","Total.Female","Median.Age", "Median.Age_Male", "Median.Age_Female","FIPS")]
income = income[c("Median.Income", "FIPS")]
insurance = insurance[c("Insurance.Coverage","FIPS")]
poverty = poverty[c("Poverty","FIPS")]
data = merge(x= agesex, y=income, by="FIPS", all=TRUE)
data = merge(x= data, y=insurance, by="FIPS", all=TRUE)
data = merge(x= data, y=poverty, by="FIPS", all=TRUE)
death = death[c("County","FIPS","Age.Adjusted.Death.Rate","Average.Deaths.per.Year","Recent.5.Year.Trend..2..in.Death.Rates")]
incident = incident[c("FIPS","Age.Adjusted.Incidence.Rate...cases.per.100.000","Average.Annual.Count","Recent.5.Year.Trend.in.Incidence.Rates")]
ALLDATA = merge(x=death, y=incident, by="FIPS", all=TRUE)
ALLDATA = merge(x=ALLDATA, y=data, by="FIPS", all=TRUE)
ALLDATA = ALLDATA[-which(is.na(ALLDATA$County)),]
colnames(ALLDATA)[5] <- "Death.Rates.Trend"
colnames(ALLDATA)[6] <- "Age.Adjusted.Incidence.Rate"
colnames(ALLDATA)[8] <- "Incidence.Rate.Trend"
For FIPS = 0: USA, we will use the total of each column: Total Population, Total Male, Total Female, Insurance Coverage, Total Poverty
ALLDATA$Average.Deaths.per.Year <- as.numeric((ALLDATA$Average.Deaths.per.Year))
ALLDATA$Average.Annual.Count <- as.numeric((ALLDATA$Average.Annual.Count))
ALLDATA[1,9] = sum(ALLDATA[-1,9], na.rm=TRUE) #TOTAL POPULATION
ALLDATA[1,10] = sum(ALLDATA[-1,10], na.rm=TRUE) #TOTAL MALE
ALLDATA[1,11] = sum(ALLDATA[-1,11], na.rm=TRUE) #TOTAL FEMALE
ALLDATA[1,16] = sum(ALLDATA[-1,16], na.rm=TRUE) #INSUARANCE COVERAGE
ALLDATA[1,17] = sum(ALLDATA[-1,17], na.rm=TRUE) #POVERTY
For each county, the population are varies. Therefore, it is easier to work with percentage.
Male Percent: Total Male/Total Population
Poverty_Percent: Total Poverty Status/Total Population
Insurance_Percent: Total People with Insuarance/Total Population
Death_Rate: Average Deaths per Year/Total Population
Incidence_Rate: Average Incidence per Year/Total Population
ALLDATA$Male_Percent = ALLDATA$Total.Male/ALLDATA$Total.Population
ALLDATA$Female_Percent = ALLDATA$Total.Female/ALLDATA$Total.Population
ALLDATA$Poverty_Percent = ALLDATA$Poverty/ALLDATA$Total.Population
ALLDATA$Insurance_Percent = ALLDATA$Insurance.Coverage/ALLDATA$Total.Population
ALLDATA$Death_Rate = ALLDATA$Average.Deaths.per.Year/ALLDATA$Total.Population
ALLDATA$Incidence_Rate = ALLDATA$Average.Annual.Count/ALLDATA$Total.Population
Looking at our data, there are 6 datapoints with N/A Death Rates. This is because there is no Population Data for these 6 counties. We are removing because these 6 points because mortality rate is the targeted variable.
CLEANDATA = ALLDATA[-which(is.na(ALLDATA$Death_Rate)),]
CLEANDATA$Average.Deaths.per.Year <- as.numeric((CLEANDATA$Average.Deaths.per.Year))
Remove FIPS = 15005 and 48301 because Incident_Rate >1. There might be data inconsistency.
CLEANDATA=CLEANDATA[CLEANDATA$FIPS!="15005",]
CLEANDATA=CLEANDATA[CLEANDATA$FIPS!="48301",]
Remove entries with Average Death per Year <16. Accoording to the dataset information, the county which has an "*" for age adjusted death rates are supressed. These data may cause bias to our model.
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="1",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="2",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="3",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="4",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="5",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="6",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="7",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="8",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="9",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="10",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="11",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="12",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="13",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="14",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="15",]
CLEANDATA=CLEANDATA[CLEANDATA$Average.Deaths.per.Year!="16",]
Add a DATA POINT: Average of all columns, excluding the FIPS = 0 (USA Total). There are over 3000 counties with various population, I want to get a datapoint that is the mean of everything. If this data point became an outliers, that means the data is skewed.
library(dplyr)
CLEANDATA.noUS=CLEANDATA[-c(1),]
Mean = summarize_all(CLEANDATA.noUS,mean)
finaldata = rbind(CLEANDATA,Mean)
First look at Data: Create a pairs plot to see the relationship between variables of interest.
pairs(~Death_Rate+Median.Age+Median.Income+Insurance_Percent+Male_Percent+Incidence_Rate+Poverty_Percent, data=finaldata)
We can see that there are some linear relationship between some data points such as Insurance Percent and Male Percent, Death Rate and Incidence Rate.
To analyze the data, we will use these methods:
Linear Regression Model:
Backward Elimination We will use backward elimination to choose an optimal model for Death Rate and Other Factors. There are a lot of Factors to be consider, with Backward Selection, we can try to fit all variable to the model and eliminate one by one the factor that is not meet the requirement. We will use ANOVA Table to get the F statistics for each regressor. Upon completing Backward Elimination, we weill determine which factors are most significant to our data.
Multicollinearity We would like to see if the coefficients for each factors are independent of each other. We will compute Variance Inflation Factors for each factors to measure the combined effect of the linear dependencies among the regressors.
Check the influence of possible influential observations We will use hii to compute the leverages for observations. We will use Cook’s Distance to find influential data.
Residual Analysis With this analysis, we want to see if the linear model is appropriate, hence, the regressors are valid. We will perform Variance Stabilizing Transformation if necessary.
1. LINEAR REGRESSION MODEL
First, I want to examine the relationships between Incidence Rate and Death Rate.
Death.Incd.lm = lm(Death_Rate~Incidence_Rate, data=finaldata)
plot(Death.Incd.lm)
summary(Death.Incd.lm)
##
## Call:
## lm(formula = Death_Rate ~ Incidence_Rate, data = finaldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.012556 -0.002961 -0.002643 -0.001307 0.060309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0028604 0.0001777 16.09 <2e-16 ***
## Incidence_Rate 0.6447581 0.0110380 58.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.007792 on 2693 degrees of freedom
## Multiple R-squared: 0.5589, Adjusted R-squared: 0.5587
## F-statistic: 3412 on 1 and 2693 DF, p-value: < 2.2e-16
anova(Death.Incd.lm)
## Analysis of Variance Table
##
## Response: Death_Rate
## Df Sum Sq Mean Sq F value Pr(>F)
## Incidence_Rate 1 0.20718 0.207181 3412 < 2.2e-16 ***
## Residuals 2693 0.16352 0.000061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Linear Regression Model for Mortality Rate and Other Factors.
Death.Factor.lm1 = lm(Death_Rate~Median.Age+Median.Income+Insurance_Percent+Male_Percent+Poverty_Percent, data=finaldata)
plot(Death.Factor.lm1)
summary(Death.Factor.lm1)
##
## Call:
## lm(formula = Death_Rate ~ Median.Age + Median.Income + Insurance_Percent +
## Male_Percent + Poverty_Percent, data = finaldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.023195 -0.006646 -0.003743 0.002399 0.067951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004e-01 1.621e-02 -6.192 6.89e-10 ***
## Median.Age 6.113e-04 5.031e-05 12.152 < 2e-16 ***
## Median.Income -2.194e-07 3.223e-08 -6.808 1.23e-11 ***
## Insurance_Percent 2.951e-02 8.796e-03 3.355 0.000805 ***
## Male_Percent 1.360e-01 1.654e-02 8.224 3.09e-16 ***
## Poverty_Percent -1.346e-02 6.601e-03 -2.040 0.041496 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01094 on 2552 degrees of freedom
## (137 observations deleted due to missingness)
## Multiple R-squared: 0.1544, Adjusted R-squared: 0.1528
## F-statistic: 93.22 on 5 and 2552 DF, p-value: < 2.2e-16
anova(Death.Factor.lm1)
## Analysis of Variance Table
##
## Response: Death_Rate
## Df Sum Sq Mean Sq F value Pr(>F)
## Median.Age 1 0.032909 0.032909 275.1743 < 2.2e-16 ***
## Median.Income 1 0.008848 0.008848 73.9891 < 2.2e-16 ***
## Insurance_Percent 1 0.004489 0.004489 37.5346 1.037e-09 ***
## Male_Percent 1 0.009000 0.009000 75.2533 < 2.2e-16 ***
## Poverty_Percent 1 0.000497 0.000497 4.1598 0.0415 *
## Residuals 2552 0.305198 0.000120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2. BACKWARD ELIMINATION REGRESSION
Elimate “Poverty_Percent” from the model
Death.Factor.lm2 = lm(Death_Rate~Median.Age+Median.Income+Insurance_Percent+Male_Percent, data=finaldata)
plot(Death.Factor.lm2)
summary(Death.Factor.lm2)
##
## Call:
## lm(formula = Death_Rate ~ Median.Age + Median.Income + Insurance_Percent +
## Male_Percent, data = finaldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.023838 -0.006619 -0.003636 0.002128 0.067459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.094e-01 1.560e-02 -7.011 3.01e-12 ***
## Median.Age 6.609e-04 4.409e-05 14.990 < 2e-16 ***
## Median.Income -1.654e-07 1.838e-08 -9.002 < 2e-16 ***
## Insurance_Percent 2.904e-02 8.798e-03 3.301 0.000978 ***
## Male_Percent 1.415e-01 1.632e-02 8.670 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01094 on 2553 degrees of freedom
## (137 observations deleted due to missingness)
## Multiple R-squared: 0.1531, Adjusted R-squared: 0.1517
## F-statistic: 115.3 on 4 and 2553 DF, p-value: < 2.2e-16
anova(Death.Factor.lm2)
## Analysis of Variance Table
##
## Response: Death_Rate
## Df Sum Sq Mean Sq F value Pr(>F)
## Median.Age 1 0.032909 0.032909 274.834 < 2.2e-16 ***
## Median.Income 1 0.008848 0.008848 73.898 < 2.2e-16 ***
## Insurance_Percent 1 0.004489 0.004489 37.488 1.062e-09 ***
## Male_Percent 1 0.009000 0.009000 75.160 < 2.2e-16 ***
## Residuals 2553 0.305695 0.000120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We will take this model as our model.
Death.Factor.lm=Death.Factor.lm2
==> (3) Median Age, Median Income, Insurance Percent and Male Percent has significant affect on Death Rate.
3. MULTICOLLINEARITY
detach("package:dplyr", unload=TRUE)
library(car)
vif(Death.Factor.lm)
## Median.Age Median.Income Insurance_Percent Male_Percent
## 1.031513 1.096595 2.907290 2.807764
4. INFLUENTIAL OBSERVATIONS
First, we are looking for high leverage points.
finaldata.noNA = na.omit(finaldata)
X<-cbind(rep(1,length(finaldata.noNA$Death_Rate)), finaldata.noNA$Median.Age, finaldata.noNA$Median.Income, finaldata.noNA$Male_Percent)
hii<-diag(X%*%solve(t(X)%*%X)%*%t(X),)
p<-ncol(X)
n<-nrow(X)
which(hii>2*p/n)
## [1] 52 59 66 67 68 69 72 74 75 77 84 129 130 163 166
## [16] 175 177 194 197 199 219 224 226 228 231 237 250 262 263 267
## [31] 272 275 276 277 285 287 292 309 311 313 314 316 333 336 345
## [46] 370 372 382 395 401 408 428 438 441 443 461 464 481 502 519
## [61] 542 546 619 756 786 790 813 820 855 857 862 901 903 908 916
## [76] 921 948 962 966 974 1015 1017 1019 1021 1023 1025 1026 1028 1030 1031
## [91] 1038 1045 1046 1047 1051 1052 1057 1084 1085 1095 1106 1112 1117 1137 1192
## [106] 1201 1212 1215 1221 1226 1240 1255 1263 1270 1271 1301 1311 1341 1352 1403
## [121] 1414 1443 1477 1479 1482 1490 1492 1493 1494 1498 1499 1512 1514 1523 1557
## [136] 1567 1570 1578 1586 1652 1671 1693 1704 1709 1725 1765 1792 1815 1843 1867
## [151] 1899 1904 1916 1933 1987 2002 2007 2029 2052 2062 2069 2084 2087 2095 2111
## [166] 2114 2128 2132 2138 2141 2146 2147 2152 2162 2166 2173 2188 2193 2195 2198
## [181] 2208 2224 2228 2233 2242 2244 2251 2254 2266 2269 2286 2287 2293 2296 2298
## [196] 2301 2303 2305 2307 2311 2314 2318 2323 2328 2338 2346 2348 2353 2354 2355
## [211] 2358 2360 2361 2365 2367 2373 2388 2400 2409 2421 2464 2527 2529
library(olsrr)
ols_plot_cooksd_chart(Death.Factor.lm)
5. RESIDUAL ANALYSIS
hist(Death.Factor.lm$residuals)
plot(density(Death.Factor.lm$residuals))
qqnorm(Death.Factor.lm$residuals)
linear.student.resid = rstudent(Death.Factor.lm)
plot(Death.Factor.lm$fitted,linear.student.resid)
title("Studentized residuals versus predicted")
The residual histogram shows a high frequency (>1500) residual points from -0.01 to 0.00. This makes the residual skewed to the left, and there are more lower value than predicted. The QQ Plot also shows that the linear doesn’t follow a Normal distribution
==> We will try to transform the the data y=ln(y)
Variance stabilizing transformations
ln.Death_Rate=log(finaldata$Death_Rate)
linear.ln.Death = lm(ln.Death_Rate~Median.Age+Median.Income+Insurance_Percent+Male_Percent+Poverty_Percent, data=finaldata)
plot(linear.ln.Death)
linear.ln.Death.student.resid = rstudent(linear.ln.Death)
plot(linear.ln.Death$fitted.values, linear.ln.Death.student.resid)
title("Studentized residuals versus fittled values")
hist(linear.ln.Death$residuals)
plot(density(linear.ln.Death$residuals))
qqnorm(linear.ln.Death$residuals)
With the data transformed from y to ln(y):
- Residuals vs Fitted: The residuals are more spread (doesnot have any patterns). The model looks more linear.
- Normal QQ Plot, Histogram and Density: The residual follow closer to Normal Distribution.
- Cook’s Distance Chart: If using threshold=0.05, there will be 2 leverage points 428 and 2365.
==> Log-transforming worked to stabilize the variance.
From our data analysis, we have the results:
In conclusion, we find that Incidence Rate,Median Age, Median Income, Insurance Percent and Male Percent affects the Lung Cancer Mortality Rate for counties in the USA. However, there can still be improvements to the model, recommend to look for extra factors. The data observed needs more investigation. Maybe separate based on some criteria (small, large population) to fit each model better.